Affordable Housing Survey in City of Charlottesville

Allen Lang

This notebook is a part of a group collaboration with University of Virginia students, professors, and SmartCville. that looks at the affordability of single-family housing in the City of Charlottesville over the past 20 years. Hopefully, this project will gain insights into how difficult or easy it has been to obtain a single-family property across the City.

Outline

  1. Introduction
  2. Data Preprocessing
  3. Basic Visualization
  4. Classification into Neighborhood Planning Areas
  5. Time Trends
  6. Parcels' sale histories

1 Introduction

Back to outline

The point of this study will be to observe housing prices in the City of Charlottesville. The following packages are imported. Numpy, geopandas, and pandas are used for data handling. Folium is used for data visualization; it is a Python wrapper for the Javascript library leaflet.

Municipal Boundary Area: https://opendata.charlottesville.org/datasets/municipal-boundary-area

The Folium module adapts Leaflet maps to Python. The provided coordinates (location=38.0393, -78.4767) will be used to generate all subsequent maps of the City of Charlottesville. Cville Open Data provides an approximate municipal boundary area in the form of a GeoJson/Json layer. The houses in this study are found within this boundary. Folium can take in GeoJson layers automatically as a layer.

2 Data Preprocessing

Back to outline

The repository https://opendata.charlottesville.org/ offers data on various utilities in the City of Charlottesville. The previous section's Municipal Bounday Area is part of the Property subsection, which has various datasets. This section will extract the information that is useful for this study, formatting relevant portions of each table so that they are ready for Folium.

2.1 Reading/Modifying Sales Table

Source: https://opendata.charlottesville.org/datasets/real-estate-sales

The Sales Table contains data about sales of particular parcels. It contains their addresses, date of sale, and amount of sale. sales is a DataFrame that reads in this csv.

The Street Number and Street Address are combined for ease of use later on. The Address column represents both of them. The two separate components to this address are then dropped.

formatDate(row) is an applied function that takes every row of the DataFrame and modifies the SaleDate. The time portion is removed since none of the dates appear to specify an actual time. The backslashes are replaced with dashes to better coincide with numpy's date objects. sales now removes rows that do not have proper dates, and applies the formatDate(row) function. Comparing DF shapes, only one appeared to have an invalid date. Finally, the SalesDate column is converted into a pandas DateTime object.

2.2 Merging Sales with Residential

Source: https://opendata.charlottesville.org/datasets/real-estate-residential-details

The Residential datasheet contains a list of all the residential parcels of the area. This is important to filter out the non-residential parcels located in the Sales dataset. The datasheet has a column called "UseCode" that specifies the type of residential building in the parcel. It was discovered that some of the labels in the column were not relevant to family housing (such as parking lots and vacant lands). Filtering was done to remove these parcels from the dataset, and of the fifty or so labels, about ten are left.

resid is the DataFrame after reading in this Residential csv.

salesResid is the result of merging the sales and resid. Only parcels of UseCodes in the set labels are included. The join is done using the ParcelNumber from each dataframe.

2.3 Merging with Geocoded Addresses

The addresses in the table were then geocoded (found latitude/longitude coordinates using addresses) and located in a .csv file called 'coordinates.csv'. This was merged with the combined salesResid table. geoSalesResid is the table that will be used for subsequent sections. As a baseline, parcels with SaleAmounts less than 100 were assumed to be typos or invalid sales and filtered out. Additionally, it was found that some parcels had sales with different 'RecordID', but with the same exact 'SaleDate', 'SaleAmount', and 'ParcelNumber'. These would also be filtered out.

3 Basic Visualization

Back to outline

The goal of this section is to begin making visualizations of houses and their sale prices.

makeRecentSales(cutoffDate) is a function that creates a subset of geoSalesResid based on a cutoff date. It will only look at sales that have happened after the specified date. recentSales is an example DataFrame created that has sales from the beginning of 2019 to the end of 2019. The end of 2019 was chosen to be the standard end date since 2020 has not as of yet finished.

A Folium map can be generated with circle markers describing a coordinate that has latitude and longitude. addMarkersSale is an applied function that will read the rows of the DataFrame and add a circle marker to it. The color will be chosen using the chooseColor function which defines the sale cutoffs that will be used on the map. It uses the Yellow/Orange/Red color brewer gradient. The markers will also show the Sale Date and Amount for each. formatSalesNumber(v) is a helper function that adds commas every third digit in the sale.

A Folium map is created, centered on Charlottesville. generateSalesMap() is just a function that encloses this process of generating a map. The functions described above are used to add circle markers to every parcel that had a Sale from beginning of 2019 to the present.

4 Classification into Neighborhood Planning Areas

Back to outline

Planning Neighborhood Area: https://opendata.charlottesville.org/datasets/planning-neighborhood-area

The goal of this section is to make use of special zones in the City. The Neighborhod Development Services (NDS) createad a Cville OpenData file that lists "Planning Neighborhood Areas". Using this and the points-in-polygon problem, each property will now be able to be classified into a planning neighborhood area based on their latitude and longitude coordinates obtained from section 2.3.

4.1 GeoJSON Layers

From the map in the previous section, it can be seen that there are exist clusters of similar sale prices. JSON object files are a good way to represent geographical areas. Cville Open Data provides a set of objects called Planning Neighborhood Areas, and said set is used as a layer. generateJSONMap() produces a map of Charlottesville with these Planning Areas overlayed.

The GeoJson file has a lot of additional information for each layer. The latitude/longitude of each layer are extracted from this. The variable lay is a list of coordinate lists.

To combine these individual parcel points with the imported JSON layer into a cohesive visualization, the Point in Polygon problem must be solved. How can we automate the process of determining which region a house belongs to? The next section will describe the algorithm that ultimately determines whether a parcel (point) is located within a particular JSON object (polygon).

4.2 Point in Polygon Algorithm

With the imported JSON polygon layers, each point in the previous section should be grouped with each polygon. This is a point in polygon problem: determine whether or not a point is in a polygon defined by its vertices (set of points).

A way to solve this problem is with ray-casting. Take a point and extend it "infinitely" along an arbitrary direction (it is now a ray). Count the number of times the point intersects a side of the polygon. If the number of ray intersections is even, then it is not in the polygon. If the number of ray intersections is odd, then it is in the polygon.

Solving PIP with Ray-casting

The first issue is, given two pairs of points (that define two line segments), determine if they intersect or not. The solution is to check the orientation of these points (intersecting line segments should have their respective points in between one point from the other line segment). A algorithm written by Kite is used. The main function below is intersects(s1,s2) and it takes in two tuples.

Note: All credit for the functions _onsegment(p,q,r), orientation(p,q,r), and intersects(s1,s2) goes to Kite [https://www.kite.com/python/answers/how-to-check-if-two-line-segments-intersect-in-python].

Now the point in polygon algorithm can be applied for this scenario.

The main function is pointinpolygons(row,layout), which will be applied across the entire table with each row (point) as its argument. It will return a series of booleans showing whether or not the point is in each of the polygons in the JSON file.

The makeray(point) function receives a point and returns a sequence with two tuples inside that represent the ray. For this, each point is extended horizontally. Additionally, since all points reside in Charlottesville, the left/right longitude are the west-most/east-most points of the area, respectively.

The pointinpolygon(polygons,raysegment) function is the individual function that is applied inside pointinpolygons. It creates a list of tuples that contain the line segments of the polygon. It is assumed that the Json file is configured so that the points can be connected from smallest to biggest (index 0 to index 1, index 1 to index 2, etc) until the very last which is connected back to index 0. It then counts the number of intersections as described above about the ray-casting.

The function is used to insert a column that assigns each point to its polygon region. If it is not in any of the polygons, then it is assigned 0. The recentSales table that contained sales after 2018 adds a new column that categorizes each parcel into an area.

To verify the accuracy of the algorithm, the same table is used to create another map with the Json layers. Filtering out those without a region, there does not seem to be a point that lies outside the polygons. generateSalesMapWithJson() uses recentSales and filters out those without a region (classifed with 0). In fact, there does not appear to be any house that is actually part of region 0, meaning that none in the dataset were outside of the City of Cville area.

4.3 Grouping the points

The pandas group function can be used to group all the points by region and determine the average sale amount within that region. salesByRegion is an example of grouping the entire recentSales table by region and finding the average of their sale prices to form a Series.

A style function tells Folium how to format the JSON layers onto the map. This is necessary to customize GeoJSON layers in Folium. The color tiers similarily follow the function chooseColorSale() in section 3.

4.4 Creation of the Map

Some helper functions are created to ensure that the main map generation function is not too cluttered.

  1. produceAGroup(t,func,label): Make a DataFrame based on grouping the Regions and calculating a value (mean/med/count)
  2. checkIfZero(s,l,n): Exception checking if no sales in a region, then it will just return 0.
  3. addMarkersSaleGroup(row,currmap,group): Add markers to map and to the markers group, color based on sales price and the label shows House type, price, and year.

The generateSalesMap(salesTable) function will do the following:

  1. Import the saved table that has run pointinpolygons
  2. Drop the redundant column at the beginning
  3. Group the salesTable by the regions and determine the mean/median/count of each. Make a DataFrame for all three with the numerical indices as a column. Uses produceAGroup(t,func,label) function
  4. Load JSON file of planning areas and update each layer with the three values. Uses checkIfZero(s,l,n) function
  5. Load JSON file of city boundary and add median/average/count to JSON
  6. Create a Folium Map and include additional Tile options
  7. FeatureGroup: Add a GeoJSON of city boundary that includes average/median of whole city
  8. FeatureGroup: Add the markers of individual parcels onto the map. Uses addMarkersSaleGroup(row,currmap,group) function
  9. FeatureGroup: Add a Choropleth using DataFrame from #3 and JSON from #4 based on averages
  10. FeatureGroup: Add a Choropleth using DataFrame from #3 and JSON from #4 based on median
  11. Add LayerControl and save map

This section is just to separate the dual map into one that is mean and one that is median. This is to provide an alternative in case the dual, combined map appears to be too cluttered.

5 Trends over Time

Back to outline

The previous map collected sales without regard for the actual sale date, as all the sales over the last 20 years were included. This section and the next look at trends over time.

Folium includes plugins from the Leaflet library, with some useful ones including time as a variable in map visualizations. This section will create visuals that can be varied through the last 20 years. Two plugins used are TimesliderChoropleth and HeatMapwithTime.

5.1 Aggregating Sales by Region and by Time

Pandas can easily group rows by certain values contained in certain columns. The last 20 years of sales will be imported and grouped by their year.

makeSalesTableByYear(t,y) is a helper function that returns a subset of the DataFrame in which only sales within a given range are included. plotyearlytrends() is a wrapper function that will import the dataframe of the 20 years of sales and then plot the average and median of each year's sales.

Now, it is time to look at each individual region per year. _mean_medbyregion() is a function that imports the saved tables of sales in the last 20 years (that includes which region it is located in). It will then group sales by the sale date year and by region and find the average and median of each group. This will then be inserted into two DataFrames that will be exported.

As seen from _avg_byregion, each region will have an average sale and median sale for every year from 2000 to 2020.

5.2 Timeslider Choropleth

Timeslider choropleth is similar to a regular choropleth in that both represent variables in regions using a gradient. However, the former can take in timestamps (time from a fixed date) and produce a draggable slider for time. Two helper functions are made to produce the timeslider map.

chooseColorTime(sale) is a helper function that is analogous to those in previous sections in that it will determine the color of each choropleth object based on the sales range.

makestyledict(sale) is a function that produces a style dictionary detailing the color of each region. It uses chooseColorTime(sale) as a helper function to determine the color. It first converts the date from a datetime format (MMM/DD/YYYY) to a unix integer that is the number of seconds since epoch (1/1/1970). It uses dictionary comprehension to make a nested dictionary. Outside keys are region numbers and inside keys are the time integer.

generateTimesliderMap() is a wrapper function that reads in the geojson file of all the regions, produces a map of Charlottesville, and then creates a Timeslider.

5.3 Heat Map with Time

Another plugin produces a heat map that varies with time. The class takes in a list of list of coordinates, with the outer list being the time increment specified. The chosen time frame was in years, so the outer list has 20 lists corresponding to each of the past 20 years from 2000 to 2019. produceList() is a helper function that produces this nested list.

6: Sales History Aggregation

Back to outline

The section below observes which parcels have had multiple sales throughout the years. A quick check of the dimensions of the table show that there are quite a few parcels that have more than one sale in the records.

6.1 Creating a Sale History dictionary

A couple of functions are written to provide different ways of grouping the table and produce a sales history depending on conditions.

The getSalesHistory(table,norm=True,multi=True) function will return a dictionary of the sales in the table depending on the desired form of filtering (see above) list of functions. Multi parameter means only parcels with multiple sale histories are returned.

Below, a sale history dictionary is created. Norm is set to false so that only sales occuring after the year 1999 are included. Multi is set to True so that only parcels with more than one sale as well are included in addition to the previous condition. In short, these two conditions mean that parcels in the dictionary will:

6.2 Which Parcels have increasing sales?

Below are a collection of functions to determine properties of sale histories. The first, strictincr(saleHistory), determines if a given parcel's sale history has been strictly increasing. This means that, sorting sale amount from earliest to latest, the sale amounts will be increasing.

The second, finalincr(saleHistory), determines if a given parcel's sale history has gone up via the first and last sales. This will return true as long as the latest sale is larger than the earliest.

Two series, strictTab and finalTab are created from the two functions above. One is a boolean series seeing if a parcel has strictly increased in price while the other just checks if the last is greater than the first.

6.3 Producing a Map of percent changes

The above functions can be adapted into applied functions that will produce a series that shows percent change in between the first and last sales of a parcel. incrPercent(row,saleHistory) will divide the difference between the last and first sales by the first sale and multiply by 100 to produce a percentage. incrPercentDay(row,saleHistory) will perform the same calculation, but will divide the value by the number of days elapsed between the two sales. This serves to normalize for time.

Since finalTab from section 4.1 is a Series that contains which parcels have increased from first to last sale, it is fed into the two percent change functions and the resulting two Series are added to the DataFrame. It is then merged with the use-all DataFrame after the sale dates before 2000 have been filtered out, geoSalesResid, to produce geoSalesResidPercent.

New color functions have to be chosen as the numerical values differ from sales prices.

  1. chooseColorPerc(sale): colors for percent change between first/last sale
  2. chooseColorPercNorm(sale): colors for normalized percent change
  3. addMarkersPerc(row,currmap,norm,group): similar to that of addMarkersSaleGroup(row,currmap,group), with the exception being that it must also add keep track of whether or not it has bee normalized or not.

The generatePercentMap(row,currmap,group) function will do the following:

  1. Use pointinpolygons function on salesTable, producing a Series that categorizes each parcel into a JSON region. Insert that Series into the salesTable (points in no regions are given "0").
  2. Group the table by the regions categorized and determine the mean
  3. Make it a DataFrame with the numerical indices as a column
  4. Load JSON file of planning areas and update each layer with the calculated average
  5. Load JSON file of city boundary and add median/average sale to JSON
  6. Create a Folium Map and include additional Tile options
  7. FeatureGroup: Add a GeoJSON of city boundary that includes average/median of whole city
  8. FeatureGroup: Add the markers of individual parcels onto the map
  9. FeatureGroup: Add a Choropleth using DataFrame from #3 and JSON from #4 based on averages
  10. FeatureGroup: Add a Choropleth using DataFrame from #3 and JSON from #4 based on median
  11. Add LayerControl and save map